我们描述了关于多语言核心分辨率的CRAC 2022共享任务的获胜提交。我们的系统首先求解了提及检测,然后使用先进的最大化方法在检索到的跨度上链接,并且这两个任务均与共享变压器的权重进行微调。我们报告了微调各种预审预告额的结果。此贡献的中心是微调的多语言模型。我们发现了一个具有足够大的编码器的大型多语言模型,可以全面提高所有数据集的性能,因此不仅限于代表性不足的语言或类型上相对语言的群体。源代码可在https://github.com/ufal/crac2022-corpipe上获得。
translated by 谷歌翻译
我们提出了一种基于角色的非自由前口GEC方法,自动生成的字符变换。最近,校正编辑的每字分类已经证明了当前编码器解码器GEC系统有效,并行化替代方案。我们表明替换编辑可能是次优,导致形态学上丰富的语言中拼写,虚拟化和误差的规则爆炸,并提出了一种从GEC语料库产生字符变换的方法。最后,与宣传系统相比,我们培训捷克,德国和俄罗斯,达到固体成果和戏剧性加速的人物转型模型。源代码在https://github.com/ufal/wnut2021_character_transformations_gec发布。
translated by 谷歌翻译
我们在W-Nut 2021(范德Goot等,2021A)中向多语言词汇标准化(多群体)共享任务提供了胜利的进入,该任务在11种语言中评估了12个社交媒体数据集的词汇标准化系统。我们将解决我们的解决方案基于预训练的字节级语言模型,Byt5(Xue等人,2021A),我们进一步列入合成数据,然后对真实标准化数据进行微调。我们的系统通过内在评估中的广泛保证金实现了最佳性能,以及通过依赖解析中的外在评估中的最佳性能。源代码在https://github.com/ufal/molilexnorm2021和https://huggingface.co/ufal的微调模型中发布。
translated by 谷歌翻译
已知深神经模型对输入噪声的敏感性是一个具有挑战性的问题。在NLP中,模型性能通常与自然发生的噪声恶化,例如拼写错误。要缓解此问题,模型可能会利用人为中断数据。然而,到目前为止已经任意确定产生的噪声的量和类型。因此,我们建议统计从语法纠错的语料库统计上的错误。我们对多种语言的若干先进的NLP系统进行了彻底的评估,其中任务包括句法分析,名为实体识别,神经机翻译,胶水基准和阅读理解的子集。我们还比较两种解决性能下降的方法:a)培训我们框架生成的中断数据的NLP模型;b)减少外部系统进行自然语言校正的输入噪声。代码在https://github.com/ufal/kazitext上发布。
translated by 谷歌翻译
Modeling lies at the core of both the financial and the insurance industry for a wide variety of tasks. The rise and development of machine learning and deep learning models have created many opportunities to improve our modeling toolbox. Breakthroughs in these fields often come with the requirement of large amounts of data. Such large datasets are often not publicly available in finance and insurance, mainly due to privacy and ethics concerns. This lack of data is currently one of the main hurdles in developing better models. One possible option to alleviating this issue is generative modeling. Generative models are capable of simulating fake but realistic-looking data, also referred to as synthetic data, that can be shared more freely. Generative Adversarial Networks (GANs) is such a model that increases our capacity to fit very high-dimensional distributions of data. While research on GANs is an active topic in fields like computer vision, they have found limited adoption within the human sciences, like economics and insurance. Reason for this is that in these fields, most questions are inherently about identification of causal effects, while to this day neural networks, which are at the center of the GAN framework, focus mostly on high-dimensional correlations. In this paper we study the causal preservation capabilities of GANs and whether the produced synthetic data can reliably be used to answer causal questions. This is done by performing causal analyses on the synthetic data, produced by a GAN, with increasingly more lenient assumptions. We consider the cross-sectional case, the time series case and the case with a complete structural model. It is shown that in the simple cross-sectional scenario where correlation equals causation the GAN preserves causality, but that challenges arise for more advanced analyses.
translated by 谷歌翻译
In addition to being a widely recognised novelist, Milan Kundera has also authored three pieces for theatre: The Owners of the Keys (Majitel\'e kl\'i\v{c}\r{u}, 1961), The Blunder (Pt\'akovina, 1967), and Jacques and his Master (Jakub a jeho p\'an, 1971). In recent years, however, the hypothesis has been raised that Kundera is the true author of a fourth play: Juro J\'ano\v{s}\'ik, first performed in a 1974 production under the name of Karel Steigerwald, who was Kundera's student at the time. In this study, we make use of supervised machine learning to settle the question of authorship attribution in the case of Juro J\'ano\v{s}\'ik, with results strongly supporting the hypothesis of Kundera's authorship.
translated by 谷歌翻译
We consider the nonstochastic multi-agent multi-armed bandit problem with agents collaborating via a communication network with delays. We show a lower bound for individual regret of all agents. We show that with suitable regularizers and communication protocols, a collaborative multi-agent \emph{follow-the-regularized-leader} (FTRL) algorithm has an individual regret upper bound that matches the lower bound up to a constant factor when the number of arms is large enough relative to degrees of agents in the communication graph. We also show that an FTRL algorithm with a suitable regularizer is regret optimal with respect to the scaling with the edge-delay parameter. We present numerical experiments validating our theoretical results and demonstrate cases when our algorithms outperform previously proposed algorithms.
translated by 谷歌翻译
Cross entropy loss has served as the main objective function for classification-based tasks. Widely deployed for learning neural network classifiers, it shows both effectiveness and a probabilistic interpretation. Recently, after the success of self supervised contrastive representation learning methods, supervised contrastive methods have been proposed to learn representations and have shown superior and more robust performance, compared to solely training with cross entropy loss. However, cross entropy loss is still needed to train the final classification layer. In this work, we investigate the possibility of learning both the representation and the classifier using one objective function that combines the robustness of contrastive learning and the probabilistic interpretation of cross entropy loss. First, we revisit a previously proposed contrastive-based objective function that approximates cross entropy loss and present a simple extension to learn the classifier jointly. Second, we propose a new version of the supervised contrastive training that learns jointly the parameters of the classifier and the backbone of the network. We empirically show that our proposed objective functions show a significant improvement over the standard cross entropy loss with more training stability and robustness in various challenging settings.
translated by 谷歌翻译
Industrial Internet of Things (IoT) systems increasingly rely on wireless communication standards. In a common industrial scenario, indoor wireless IoT devices communicate with access points to deliver data collected from industrial sensors, robots and factory machines. Due to static or quasi-static locations of IoT devices and access points, historical observations of IoT device channel conditions provide a possibility to precisely identify the device without observing its traditional identifiers (e.g., MAC or IP address). Such device identification methods based on wireless fingerprinting gained increased attention lately as an additional cyber-security mechanism for critical IoT infrastructures. In this paper, we perform a systematic study of a large class of machine learning algorithms for device identification using wireless fingerprints for the most popular cellular and Wi-Fi IoT technologies. We design, implement, deploy, collect relevant data sets, train and test a multitude of machine learning algorithms, as a part of the complete end-to-end solution design for device identification via wireless fingerprinting. The proposed solution is currently being deployed in a real-world industrial IoT environment as part of H2020 project COLLABS.
translated by 谷歌翻译
图像二进制技术通常用于增强嘈杂和/或退化的图像来迎合不同文档图像Anlaysis(DIA)应用(如单词斑点,文档检索和OCR)。大多数现有技术都集中在将像素图像馈送到卷积神经网络中以完成文档二进制化,这在使用不完全减压的情况下需要处理的压缩图像时可能不会产生有效的结果。因此,在本研究论文中,通过使用双重鉴别器生成对抗网络(DD-GAN),提出了使用JPEG压缩图像的文档图像二进制的想法。在这里,两个歧视者网络 - 全球和本地工作在不同的图像比率上,并将焦点损失用作发电机损失。提出的模型已通过不同版本的DIBCO数据集进行了彻底的测试,该数据集具有诸如孔,擦除或弄脏的墨水,灰尘和放错地方的挑战。在时间和空间复杂性方面,该模型被证明是高度鲁棒,有效的,并且还导致了JPEG压缩域中的最新性能。
translated by 谷歌翻译